A New Variables Selection and Dimensionality Reduction Technique Coupled with Simca Method for the Classification of Text Documents
نویسندگان
چکیده
Classification of text documents is of significant importance in the field of data mining and machine learning. However, the vector representation of documents, in classification problems, results in a highly sparse data with immense number of variables. This necessitates applying an efficient variables selection and dimensionality reduction technique that ensures model’s selectivity, accuracy and robustness with fewer variables. This paper introduces a new coefficient, the Variables Strength Coefficient (VSC), which permits retaining variables with strong Modeling and Discriminatory powers. A variable with VSC greater than a predefined threshold is considered to have strong power in both modeling data and discriminating classes and thus retained, while weaker variables are discarded. This straightforward technique results in maximizing the differences between classes while preserving the modeling power of variables. This paper also proposes applying a classification technique that is widely used in chemical analysis domain; the supervised learning algorithm SIMCA. The soft and independent nature of SIMCA allows multi-labeling of text documents, in addition to, the ability to include new classes later on without affecting the created model. VSC-SIMCA was applied on the data set ‘CNAE-9’ and the results obtained were compared to classification and dimensionality reduction work done on the same data set in the literature. VSC-SIMCA technique shows superior performance over other techniques, both in the amount of dimensionality reduction, as well as, the classification performance. The improved classification precision, with substantial fewer variables, demonstrates the contribution of the proposed approach of this research.
منابع مشابه
An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification
Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...
متن کاملAn Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification
The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...
متن کاملA New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier
With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...
متن کاملAn Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification
The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...
متن کاملImproving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کامل